Analysis for Full Corpus through Module 7 (Language Models, NLP, Vector Space Models, Similarity and Clustering, PCA)

DS 5001: Exploratory Text Analytics

Cecily Wolfe (cew4pf)

Spring 2022

LIB Table

CORPUS Table

M03: Language Models

Create Training Vocab ($V_{train}$)

Generate Training Sentences

Generate and Count Ngrams

n-gram token table

Unigram table

Bigram table

Trigram table

Create language model

Generate a text with the .generate_text() method of the langmod.NgramLanguageModel object (model)

Examining redundancy for unigrams, bigrams, trigrams $\rightarrow$ redundancy increases

Using the bigram model represented as a matrix (too large to use BGX = model.LM[1].n.unstack() so use method below), explore the relationship between bigram pairs using the following lists for the first and second words of the bigrams of interest

M05: Vector Space Models

Zipf's Law:

Add Term Rank $r$ to VOCAB

Alternate Rank: words that appear the same number of times given the same rank

Compute Zipf's $k$ using term_rank and term_rank2

Rank vs. N (frequency n)

As rank (term_rank2) increases, frequnecy (n) decreases

BOW (Bag of Words) and TFIDF (Term Frequency - Inverse Document Frequency)

Document-Term Count Matrix DTCM

Reduce number of features in VOCAB, TFIDF matrix to the 1000 most significant terms

"Collapse" the TFIDF matrix so that it contains mean TFIDF of each term by book.

Rank and TFIDF Mean

Rank and DFIDF

M06: Similarity and Clustering

Collapse Bags (to use for clustering)

Mean TFIDF for each book for all terms

Mean TFIDF for all book for 1000 most significant terms only

DOC Table

Normalized Tables for Clustering

Create table of book pairs (doc pair table PAIRS)

Compute distance measures between all pairs of books using pdist()

Compare Distributions

Hierarchical agglomerative cluster diagrams for the distance measures

Top 20 nouns by DFIDF, sorted in descending order (including plural nouns but not proper nouns)

Most "Significant" Book based on mean TFIDF

Compare Distributions

Compare Z normalized distributions

K-Means

Alogirthm Overview

M07: Features and Components

Manual PCA Methods with Only 1000 Most Significant Terms (excluding proper nouns)

Prince PCA Method with entire TFIDF

Prince PCA with Outliers Removed

Prince PCA Method with 1000 most significant terms excluding proper nouns (TFIDF_sigs)